language identification
PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection
Rezaabad, Ali Lotfi, Khanal, Bikram, Chaurasia, Shashwat, Zeng, Lu, Hong, Dezhi, Bashashati, Hossein, Butler, Thomas, Ganji, Megan
Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.
- Europe > Austria > Vienna (0.14)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Dominican Republic (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research
Apostu, Alexandru-Mihai, Preda, Andrei, Damir, Alexandra Daniela, Bolocan, Diana, Ionescu, Radu Tudor, Croitoru, Ioana, Gaman, Mihaela
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.34)
A Characterization of List Language Identification in the Limit
Charikar, Moses, Pabbaraju, Chirag, Tewari, Ambuj
We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.
- Asia > Afghanistan > Parwan Province > Charikar (0.40)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
Intra - sentence multilingual speech synthesis (code - switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed - language contexts. We introduce Script - First Multilingual Synthesis with Adaptive Locale Resolution (SFMS - ALR) an engine - agnostic framework for fluent, real - time code - switched speech generation. SFMS - ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment - aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech
Bafna, Niyati, Wiesner, Matthew
Prior research indicates that LID model performance significantly declines on accented speech; however, the specific causes, extent, and characterization of these errors remain under-explored. (i) We identify a common failure mode on accented speech whereby LID systems often misclassify L2 accented speech as the speaker's native language or a related language. (ii) We present evidence suggesting that state-of-the-art models are invariant to permutations of short spans of speech, implying they classify on the basis of short phonotactic features indicative of accent rather than language. Our analysis reveals a simple method to enhance model robustness to accents through input chunking. (iii) We present an approach that integrates sequence-level information into our model without relying on monolingual ASR systems; this reduces accent-language confusion and significantly enhances performance on accented speech while maintaining comparable results on standard LID.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Oceania > Australia (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
Adapting Language Balance in Code-Switching Speech
Ugan, Enes Yavuz, Pham, Ngoc-Quan, Waibel, Alexander
Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
- North America > United States (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Asia > East Asia (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Asia > Indonesia > Bali (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (24 more...)
- Law (0.93)
- Information Technology (0.93)
DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Ojo, Jessica, Kamel, Zina, Adelani, David Ifeoluwa
Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (21 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Communications > Social Media (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Geolocation-Aware Robust Spoken Language Identification
Wang, Qingzheng, Shim, Hye-jin, Sun, Jiancheng, Watanabe, Shinji
--While Self-supervised Learning (SSL) has significantly improved Spoken Language Identification (LID), existing models often struggle to consistently classify dialects and accents of the same language as a unified class. T o address this challenge, we propose geolocation-aware LID, a novel approach that incorporates language-level geolocation information into the SSL-based LID model. Specifically, we introduce geolocation prediction as an auxiliary task and inject the predicted vectors into intermediate representations as conditioning signals. This explicit conditioning encourages the model to learn more unified representations for dialectal and accented variations. Experiments across six multilingual datasets demonstrate that our approach improves robustness to intra-language variations and unseen domains, achieving new state-of-the-art accuracy on FLEURS (97.7%) and 9.7% relative improvement on ML-SUPERB 2.0 dialect set.
Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech
Shankar, Lavanya, Perera, Leibny Paola Garcia
This paper addresses this challenge by using Zipformer to handle the nuances of speech which contains two imbalanced languages - Mandarin and English - in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the em-beddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline.
- North America > United States (0.04)
- Asia > East Asia (0.04)